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Abstract 

This paper presents a simulation-based framework for sequential inference from partially and dis- 
cretely observed point process (PP's) models with static parameters. Taking on a Bayesian perspective 
for the static parameters, we build upon sequential Monte Carlo (SMC) methods, investigating the 

problems of performing sequential filtering and smoothing in complex examples, where current meth- 
od ■ 

\ ods often fail. We consider various approaches for approximating posterior distributions using SMC 

Our approaches, with some theoretical discussion are illustrated on a doubly stochastic point process 



^ ■ applied in the context of finance. 

in 
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O I 1 Introduction 
(N 

Partially observed point processes provide a rich class of models to describe real data. For example, such 
■ models are used for stochastic volatility (Barndorff-Nielsen & Shepliard, 2001) in finance, descriptions of 

" queuing data in operations research (Fearnhead, 2004), important seismological models (Daley & Vere- 

Jones, 1988) and applications in nuclear physics (Snyder & Miller, 1998). For complex dynamic models, 
that is, vifhen data arrive sequentially in time, studies date back to at least Snyder (1972). Hovifcver, 
fitting Bayesian models requires SMC (e.g. Doucet et al. (2000)) and Markov chain Monte Carlo (MCMC) 
methods. The main developments in this field include the work of: Centanni & Minozzo (2006a, b); Green 
(1995); Del Moral et al. (2006,2007); Doucet et al. (2006); Roberts et al. (2004), Rydberg & Shephard 
(2000), see also Whiteley et al. (2011). As we describe below, the SMC methodology may fail in some 
scenarios and we will describe methodology to deal with the problems that will be outlined. 

Informally, the problem of interest is as follows. A process is observed discretely upon a given time- 
interval [0,T]. The objective is to draw inference at time-points tn ~ Q < ti < ■ ■ ■ < tffi < T = tm+i, on 
the unobserved marked PP (fcf„ , 4>i:kt^ Ci:fct„ ): where 4>i:kt„ = ■ ■ ■ ^ 4'kt„ ) are the ordered event times 
(constrained to [0, i„]) and Ciifct,^ — (Cii • • • ) Cfct„ ) are marks, given the observations yi-.n^ ■ In other words 
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to compute, for n > 1, at time t 



■n 



(1) 



T^n{kt„ , 0i:fct„ , Ci:fct„ Ivi-.rt^ ) Smoothing 



(2) 



7i"«(fct„ - fct„-i,0fct„_i+i:fct„,Cfct„_,+i:fet„ l2;i:n„) filtering. 



In addition, there are static parameters specifying the probabihty model and these parameters will be 
estimated in a Bayesian manner. At this stage a convention in our terminology is established. An algorithm 
is said to be sequential if it is able to process data as it arrives over time. An algorithm is said to be on-line 
if it is sequential and has a fixed computational cost per iteration/time-step. 

One of the first works applying computational methods to PP models, was Rydberg & Shephard (2000). 
They focus upon a Cox model where the unobserved PP parameterizes the intensity of the observations. 
Rydberg & Shephard (2000) used the auxiliary particle filter (Pitt & Shephard, 1997) to simulate from 
the posterior density of the intensity at a given time point. This was superseded by Centanni & Minozzo 
(2006a, b), which allows one to infer the intensity at any given time, up to the current observation. Cen- 
tanni & Minozzo (2006a, b) perform an MCMC-type filtering algorithm, estimating static parameters using 
stochastic EM. The methodology cannot easily be adapted to the case where the static parameters are 
given a prior distribution. In addition, the theoretical validity of the approach has not been established; 
this is verified in Proposition [1] of this article. 

SMC samplers (Del Moral et al. 2006) are the focus of this paper and can be applied to all the problems 
stated above. SMC methods simulate a set of > 1 weighted samples, termed particles, in order to 
approximate a sequence of distributions, which may be chosen by the user, but which include (or closely 
related to) the distributions in (IT|) and Such methods are provably convergent as A^ — oo (Del Moral, 
2004). A key feature of the approach is that the user must select: 

1. the sequence of distributions 

2. the mechanism by which particles are propagated . 

If points 1. and 2. are not properly addressed, there can be a substantial discrepancy between the proposal 
and target; thus the variance of the weights will be large and estimation inaccurate. This issue is particularly 
relevant when the targets are defined on a sequence of nested spaces, as is the case for the PP models - the 
space of the point process trajectories becomes larger with the time-parameter n. Thus, in choosing the 
sequence of target distributions, we are faced with the question of how much the space should be enlarged 
at each iteration of the SMC algorithm and how to choose a mechanism to propose particles in the new 
region of the space. This issue is referred to as the difficulty of extending the space. 

Two solutions are proposed. The first is to saturate the state-space; it is supposed that the observation 
interval, [0,T], of the PP is known a priori. The sequence of target distributions is then defined on the 
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whole interval and one sequentially introduces likelihood terms. This idea circumvents the problem of 
extending the space, at an extra computational cost. Inference for the original density of interest can 
be achieved by importance sampling (IS). This approach cannot be used if T is unkown. In the second 
approach, entitled data-point tempering, the sequence of target distributions are defined by sequentially 
introducing likelihood terms. This is achieved as follows: given that the PP has been sampled on [0, i„] the 
target is extended onto [0,t„+i] by sampling the missing part of the PP. Then one introduces likelihood 
terms into the target that correspond to the data (as in Chopin (2002)). Once all of the data have been 
introduced, the target density is ([1]). It should be noted that neither of the methods arc online, but some 
simple fixes are detailed. 

Section [5] introduces a doubly stochastic PP model from finance which serves as a running example. In 
Section [3] the ideas of Centanni & Minozzo (2006a, b) are discussed; it is established that the method is 
theoretically valid under some assumptions. The difficulty of extending the state space is also demonstrated. 
In Section 2] we introduce our SMC methods. In Section [5] our methods are illustrated on the running 
example. In Section |6] we detail extensions to our work. 

Some notations are introduced. We consider a sequence of probability measures {•ci7„}i<n<m» on spaces 
{(G„, t/„)}i<„<„i» , with dominating cr— finite measures. Bounded and measurable functions on G„, /„ : 
Gn ^ R, are written Bb{Gn) and ||/„|| — sup^^Q^ |/„(x)|. n7„ will refer to either the probability measure 
Wnidx) or density -cun{x). 

2 Model 

The model we use to illustrate our ideas is from statistical finance. An important type of financial data is 
ultra high frequency data which consists of the irregularly spaced times of financial transactions and their 
corresponding monetary value. Standard models for the fitting of such data have relied upon stochastic 
differential equations driven by Wiener dynamics; a debatable assumption due to the continuity of the 
sample paths. As noted in Centanni & Minozzo (2006b), it is more appropriate to model the data as a Cox 
process. Due to the high frequency of the data, it is important to be able to perform sequential/on-line 
inference. Data are observed in [0, T]. In the context of finance, the assumption that T be fixed is entirely 
reasonable. For example, when the model is used in the context of equities, the model is run for the 
trading day; indeed due to different (deterministic) patterns in financial trading, it is likely that the fixed 
parameters below are varied according to the day. 

A marked PP, of > 1 points, is observed in time-period [0, T]. This is written yi-.rT = {^i-.ttj ^i-.t-t) ^ 
ilr^T X 2""^ with rtr.T — { ^i:rT '■ ■ ■ ■ *^ -^li '— ' — Here the cu are the transaction times 

and ^ are the log-returns on the financial transactions. An appropriate model for such data, as in Centanni 
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& Minozzo (2006b), is 

i=l 

p(wi:rj,|{AT}) oc ]J{Aa;Jexp| A„dw| 

with p a generic density, a are assumed to be t-distributed on 1 degree of freedom, location fi, 
scale (J and A„ is the intensity. The unobserved intensity process is assumed to follow the dynamics 
d\t = —sXtdt + dJt with {Jt} a compound Poisson process: Jt = X^^li Cj with {Kt} a Poisson process 
with rate parameter v and i.i.d. jumps C,j ~ £x{l/^), £x{-) is the exponential distribution. That is, for 
t€[0,T], 

k 

(3) At = I Aoe-«* + Oe"'^*"^^'' | 

with (f)j the jump times of the unobserved Poisson process and Aq fixed throughout (using a short prelimi- 
nary time series that is available in practice). 
We define the following notation: 

Xn,l = {kt^ - '=tn-ii?!'fet„_i+l:fet„)Cfct„_i+l:fet„), 

y-a = (Wi:rt„,6:rt„), 
yn,l = ('^n„_i+l:rt„,^rt„_i+l:rt„)- 

Here Xn (respectively y„) is the the restriction of the hidden (observed) PP to events in [0, t„]. Similarly 
Xn,i (respectively yn,i) is the the restriction of the hidden (observed) PP to events in [i„_i,t„]. 

The objective is to perform inference at times 0<ti<-<tm<T = tfn+i, that is, to update the 
posterior distribution conditional on the data arriving in [tn-i, tn]- To summarize, the posterior distribution 
at time tn is 



(fci„) X P(M,cr) 



7r„(a;„,/x,cr|y„) cx TT {p(^i;/z,CT)Ai^J expi - / Xudui x T\{p{(i)}p{(j)i.,ktjp 
i=i -^0 J i=l 

(4) = llO,tr,]{yn;Xn,l^,(j) X p{Xn) X ^(/i, Cj) 

with /[o,t„] corresponding to the first part of the equation above, ^ .A/'(q;^, ^^), a ^ Qa{aa, /S^), ^i:kt\kt ^ 
^*fc,t„' kt ~ Po{'-ft) and where Ua is the uniform distribution on the set A, J\f{ii,a^) is the normal 
distribution of mean /x and variance cr^, Qa{a,/3) the Gamma distribution of mean a//3 and Vo is the 
Poisson distribution. p(x„) is the notation for the prior on the marked point-process and p(/i, cr) is the 
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notation for the prior on (/i,cr). Later a ttq is introduced which will refer to an initial distribution. Note 
it is possible to perform inference on {fJ-,<j) independently of the unobserved PP; it will not significantly 
complicate the simulation methods to include them. 

It is of interest to compute expectations w.r.t. the {7r„}i<„<m*, and this is possible, using the SMC 
methods below (Section [231 • However, such algorithms are not of fixed computational cost; the sequence 
of spaces over which the {7r„}i<„<m* lie is increasing. These methods can also be used to draw inference 
from the marginal posterior of the process, over (t„-i,t„]; such algorithms can be designed to be of fixed 
computational complexity, for example by constraining any simulation to a fixed-size state-space. This 
idea is considered further in Section [4.31 



3 Previous Approaches 

One of the approaches for performing filtering for partially observed PP's is from Centanni & Minozzo 
(2006a). In this Section the parameters (/i,(T) are assumed known. Let 

fceNo ^ ^ 
This is the support of the target densities for this method. 
The following decomposition is adopted 

rr\ i- I- ^ '(t„_i,t„] (yn,i; 2:„) ,_ n 

(5) 7r„(a;„|y„) = — — — p(a;„,i)7r„_i(a;„_i|y„_i) 

Pn(yn,l|yri-l) 

Vn{Vn,\\Vn-\) = j ^t^^i ,t„]{VnX^ Xn)'p{Xn^)'^n-l{Xn~-l\Vn-l)dXn■ 

At time n > 2 of the algorithm, a reversible jump MCMC kernel (although the analysis below is not 
restricted to such scenarios) is used for N steps to sample from the approximated density 

'^n{Xn\yn) OC '(t„_i,t„](y n,!? Xri)'^{Xn^\)Sr^ j^_'^{Xri—l^ 

where S^^_i{xn-i) '■— X^ili ^{jf (') -^{xn^i) with x'^l\, . . . , X^^^^^ obtained from a reversible jump 
MCMC algorithm of invariant measure ir^-i- The algorithm for n — \ targets tti exactly; there is no 
empirical density S^q. At time n = 1 the algorithm starts from an arbitrary point x"^^ G Ei and sub- 
sequent steps are initialized by a draw from the empirical >S'^„_2 and the prior p (this can be modified); 
— 1 additional samples are simulated. 

The above algorithm can be justified, theoretically, by using the Poisson equation (e.g. Glynn & Meyn 
(1996)) and induction arguments. Below the assumption (A) is made; see the appendix for the assumption 
(A) as well as the proof. Also, the expectation below is w.r.t. the process discussed above, given the 
observed data. 
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Proposition 1. Assume (A). Then for any n > I, y„, p > 1 there exists Bp^niVn) < +oo such that for 
any /„ G Bt{En) 



(6) 



E_(i 



1 

N ■ 



N 



p 






Vn 



1/p 



< 



Bp.n iVn ) 1 1 fn 



This result helps to establish the theoretical validity of the method in Centanni & Minozzo (2006a), 
which to our knowledge, had not been established in that paper or elsewhere. In addition, it allows us to 
understand where and when the method may be of use; this is discussed in Section [ 



3.1 SMC Methods 

SMC samplers aim to approximate a sequence of related probability measures {7r„}o<n<m' defined upon 
a common space {E,£). Note that m* > 1 can depend upon the data and may not be known prior to 
simulation. For partially observed PPs the probability measures are defined upon nested state-spaces: this 
case can be similarly handled with minor modification. SMC samplers introduces a sequence of auxiliary 
probability measures {7rn}o<n<m* on state-spaces of increasing dimension (i?[o,n] := Eq x ■ ■ ■ x En, f [o,n] ■— 
^^0 • • • (8) £n)i such that they admit the {7r„}o<ri<m* as marginals. 
The following sequence of auxiliary densities is used: 

n-l 

where {in}o<n<m*-i are backward Markov kernels. In our application ttq is the prior, on Ei (as defined 
below). It is clear that ([7]) admit the {7r„} as marginals, and hence these distributions can be targeted 
using precisely the same mechanism as in sequential importance sampling/resampling; the algorithm is 
given in Figure [T] 

The ESS in Figure [T] refers to the effective sample size (Liu, 2001). This measures the weight degeneracy 
of the algorithm; if the ESS is close to N, then this indicates that all of the samples are approximately 
independent. This is a standard metric by which to assess the performance the algorithm. The resampling 
method used throughout the paper is systematic resampling. 

One generic approach is to set Kn as an MCMC kernel of invariant distribution 7r„ and i„-i as 
the reversal kernel i,i_i(a;„, x„_i) = 7r„(a;„_i)/'ir„(x„_i, a;„)/7r„(a;„) which we term the standard reversal 
kernel. One can iterate the MCMC kernels, by which we use the positive integer M to denote the number 
of iterates. It is also possible to apply the algorithm when Kn is a mixture of kernels; see Del Moral et 
al. (2006) for details on the algorithm. 
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0. Set n = 0; for i = 1, . . . , sample Xq^ ~ go and compute woiXg^) oc ttq^Xq^)/ go{XQ^). 

1. Compute the normalized importance weights, 



if the ESS = { Y:]=i ^i'«(^o:„)}7 Ei=i{^«n(^o-n)}^ < r(^) then resample the particles and set the 
importance weights to uniform. Set n = n+ l, ifn = to* + 1 stop. 



2. For i — 1, . . . ,N sample Xn\x^l\-y = x^n-i ^ -^"(^n-u ^^id compute: 

(8) W^„(X;ii^„) OC 



^"(^0:i) = W^"(^n-i:n)^^"-i(^0:«-i) ^nd retum to the start of 1. 



Figure 1: A Generic SMC Sampler. Note that T{N) is termed a threshold function such that 1 < T{N) < N 
and ESS is the effective sample size. 

3.1.1 Nested Spaces 

As described in Section [1] in complex problems it is often difficult to design efficient SMC algorithms. In 
the example in Section [21 the state-spaces of the subsequent densities are not common. The objective is 
to sample from a sequence of densities on the space, at time n, 

En= ( IJ {k} X X {R+)A xRxM+ l<n<TO*-l 

^ feGNo ^ 

with Eq — El. That is, for any 1 < n < m* — 1, En C En+i- Two standard methods for extending the 
space, as in Del Moral et al. (2006) are to propagate particles by application of 'birth' and the 'extend' 
moves. 

Consider the model in Section The following SMC steps are used to extend the space at time n of 
the algorithm. 

• Birth. A new jump is sampled uniformly in [0^^ jj^n] and a new mark from the prior. The 
incremental weight is 

,_ , 7r„(x„,/z,(T|y„)(t„ - J 

Wn(Xn-l■.n,^J■,<J) (X — TTTT T" 

• Extend. A new jump is generated according to a Markov kernel that corresponds to the random 



walk: 



log i hl^^^X ^^Z + log ^ ""'--^ - 



tn — i'kt^ J y tn — Ipk 
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with Z ^ A/'(0, 1), i9 > 0. The new mark is sampled from the prior. The backward kernel and 
incremental weight are discussed in Del Moral et al. (2007) section 4.3. 

Note, as remarked in Whiteley et al. (2011), we need to be able to sample any number of births. With an 
extremely small probability, a proposal from the prior is included to form a mixture kernel. 

In addition to the above steps an MCMC sweep is included after the decision of whether or not to 
resample the particles is taken (see step 1. of Figure [1]): an MCMC kernel of invariant measure 7r„ is 
applied. The kernel is much the same as in Green (1995). 

3.1.2 Simulation Experiment 

We applied the benchmark sampler, as detailed above, to some synthetic data in order to monitor the 
performance of the algorithm. Standard practice in the reporting of financial data is to represent the time 
of a trade as a positive real number, with the integer part representing the number of days passed since 
January 1^^ 1900 and the non-integer part representing the fraction of 24 hours that has passed during 
that day; thus, one minute corresponds to an interval of length 1/1440. Therefore we use a synthetic data 
set with intensity of order of magnitude lO'^. The ticks were generated from a specified intensity process 
{At} that varied smoothly between three levels of constant intensity at A = 6000, A = 2000 and A — 4000. 
The log returns were sampled from the Cauchy-distribution, location /i = and scale a = 2.5 x 10^''. 
The entire data set was of size = 3206, [0, T] = [0, 0.9] with i„ = n * 0.003. The intensity from which 
they were generated had constant levels at 6000 in the interval [0.05,0.18]; at 4000 in the interval [0.51,0.68]; 
and at 2000 in the intervals [0.28,0.42] and [0.78,0.90]. 

The sampler was implemented with all permutations {{M, N)} for N £ {100, 1000} and M € {1, 5, 20}, 
resampling whenever the effective sample size fell below N/2 (recall N is the number of particles and M 
the MCMC iterations). When performing statistical inference, the intensity ([3]) used parameters 7 = 0.001, 
ly = 150 and s = 20. 

It was found that for this SMC sampler, the system consistently collapses to a single particle represen- 
tation of the distribution of interest within an extremely short time period. That is, resampling is needed 
at almost every time step, which leads to an extremely poor representation of the target density. Figure [2] 
shows the ESS at each time step for a paricular implementation. As can be seen, the algorithm behaves 
extremely poorly for this model. 

3.2 Discussion 

We have reviewed two existing techniques for the Bayesian analysis of partially observed PP's. It should 
be noted that there are other methods, for example in Varini (2007). In that paper, the intensity has a 
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Figure 2: Effective Sample Size plots for the SMC sampler described in Figure [TJ implemented with 
N — 1000 particles and with M — 5 MCMC sweeps at each iteration. The dashed line indicates the 
resampling threshold at N/2 — 500 particles; resampling is needed at 94.4% of the time steps. 



finite number of functional forms and the uncertainty is related to the type of form at each inference time 

The relative advantage of the approach of Centanni & Minozzo (2006a) is the fact that the state- 
space need not be extended. On page 1586 of Centanni & Minozzo (2006a) the authors describe the 
filtering/smoothing algorithm, for the process on the entire interval [0,i„] at time n; the theory discussed 
in Proposition [T] suggests that this method is not likely to work well as n grows. The bound, which is 
perhaps a little loose is, for n > 2 

2 

Bp,n{yn) = 7^{Bp + 1] + knEp^n-liVn-l) 

with Bp,i{yi) — -^^^^[Bp + I], Bp a constant related to the Biirkholder/Davis inequalities (e.g. Shiryaev 
(1996)), €-n[yn) G (0,1) and fc„ > a constant that is model/data dependent which is possibly bigger 
than 1. The bound indicates that the error can increase over time, even under the exceptionally strong 
assumption (A) in the appendix. This is opposed to SMC methods which are provably stable, under 
similar assumptions (and that the entire state is updated), as rt — > oo (Del Moral, 2004). In other words, 
whilst the approach of Centanni & Minozzo is useful in difficult problems, it is less general with potentially 
slower convergence rate than SMC. Intuitively, it seems that the method of Centanni & Minozzo (2006a) 
is perhaps only useful when considering the process on (t„_i,i„], as the process further back in time is 
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not rejuvenated in any way. As a result, parameter estimation may not be very accurate. In addition, the 
method cannot be extended to a sequential algorithm such that fully Bayesian inference is possible. As 
noted above, SMC samplers can be used in such contexts, but requires a computational budget that grows 
with the time parameter n. 

As mentioned above, SMC methods are provably stable under some conditions as the time parameter 
grows. However, some remarks related to the method in Figure [1] can help to shed some light on the poor 
behaviour in Section [3.1.21 Consider the scenario when one is interested in statistical inference on [0,ti]. 
Suppose for simplicity, one can write the posterior on this region as 



for fixed ri,fi,a. If one considers just pure importance sampling, then conditioning upon the data, one 
can easily show that for any tti — (square) integrable / with J f{xi)p{xi)dxi = 0, the asymptotic variance 
in the associated Gaussian central limit theorem is lower-bounded by: 



Then, for any mixing type sequence of data the asymptotic variance will for some / and in some scenarios, 
grow without bound as ri grows - this is a very heuristic observation, that requires further investigation. 
Hence, given this discussion and our empirical experience, it seems that we require a new methodology, 
especially for complex problems. 

3.3 Possible Solutions to the problems of Extending the State-Space 

An important remark associated to the simulations in Section I3.1.2[ is that it cannot be expected that 
simply increasing the number of particles will necessarily a significantly better estimation procedure. The 
algorithm completely crashes to a single particle and it seems that naively increasing computation will not 
improve the simulations. 

As discussed above, the inherent difficulty of sampling from the given sequence of distributions is that 
of extending the state-space. It is known that conditional on all parameters except the final jump, the 
optimal importance distribution is the full conditional density (Del Moral et al. 2006). In practice, for 
many problems it is either not possible to sample from this density, or to evaluate it exactly (which is 
required). In the case that it is possible to sample from the full conditional, but the normalizing constant 
is unknown, the normalizing constant problem can be dealt with via the random weight idea (Rousset 
& Doucet, 2006). In the context of this problem we found that the simulation from the full conditional 
density of 4>kt^ was difficult, to the extent that sensible rejection algorithms and approximations for the 
random weight technique were extremely poor. 



(9) 
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Another solution, in Del Moral et al. (2007), consists of stopping the algorithm when the effective 
sample size (ESS) drops and using an additional SMC sampler to facilitate the extension of the state- 
space. However, in this example, the ESS is so low, that it cannot be expected to help. Due to above 
discussion, it is clear that a new technique is required to sample from the sequence of distributions; two 
ideas are presented below. One idea, in the context of estimating static parameters, that could be adopted 
is SMC^ (Chopin et al. 2011) which has appeared after the first versions of this article. 



4 Proposed Methods 

In the following Section, two approaches are presented to deal with the problems in Section 13.1.21 First, 
a state-space saturation approach, where sampling of PP trajectories is performed over a state space 
corresponding to a fixed observation interval. Second, a data-point tempering approach. In this approach, 
as the time parameter increases, the (artificial) target in the new region is simply the prior and the data 
are then sequentially added to the likelihood, softening the state-space extension problem. Both of these 
procedures use the basic structure of Figure (TJ with some refinements, that are mentioned in the text. As 
for the algorithms in Figure [T] we add dynamic resampling steps; when MCMC kernels are used, one can 
resample before sampling - see Del Moral et al. (2006) for details. 

4.1 Saturating the State-Space 

A simple idea, which has been used in the context of reversible jump, is to saturate the state-space. The idea 
relies upon knowing the observation period of the PP ([0,T]) a priori to the simulation. This is realistic 
in a variety of applications. For example, in Section [2J often we may only be interested in performing 
inference for a day of trading and thus can set [0,T]. 

In details, it is proposed to sample, in the case of the example in Section [21 from the sequence of target 
densities defined on the space 

(10) E = ( IJ {fc} X X {R+)A X R X (R+)2. 

The (marginal, that is in the sense of ([7])) target densities are now, denoted with a S" as a super-script: 

^^^{xn,^l,a\yn) oc f\ {p(^i; ^, a)A^, } exp |- [ Xudu\xf\ {p(Cj)}p'^(0i:fct,Jp'^(fct„)xp(^, cr) l<n<T 

where the prior on the point process is (f>i:kt\kt ^ l^<s>k tj ^ Vo{'-fT). We then use, for X„, an MCMC 
kernel of invariant measure tt^ and the standard reversal kernel discussed in Section 13.11 for the backward 
kernel. The initial distribution is the prior and the weight at time is proportional to 1 for each particle. 
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The incremental weights at subsequent time-points are simply: 

Wn(a;n-i,/Jn-i,g„-i) (X „ 1 < 71 < T. 

7i"„ _ 1 ( a;„ _ 1 , /i„ _ 1 , cr„ _ 1 1 ?;„ _ 1 ) 
Inference w.r.t. the original {7r„}i<„<„i. can be performed via IS as the supports of the targets of interest 
are contained within the proposals (i.e. via the targets of the saturated algorithm). 

4.2 Data-Point Tempering 

A simple solution to the state-space extension problem, which allows data to be incorporated sequentially, 
albeit not being of fixed computational complexity is as follows. When the time parameter increases, the 
new part of the process is simulated according to the prior. Then each new data point is added to the 
likelihood in a sequential manner. In other words if there are n data points, then there are m* = n + m 
time-steps of the algorithm. 

To illustrate, consider only the scenario of the data in [0,ti], with rt-^ > 0. Then our sequence of 
(marginal) targets are: 7r™(a;i, /i, cr) = p{xi)p{iJ.)p{a) and for 1 < n < r^^ 



Then, when considering the extension of the point-process onto [0,^2], one has a (marginal) target that is: 



i=l ^ > 

When one extends the state-space, we sample from the prior on the new segment, which leads to a unit 

incremental weight (up-to proportionality) - no backward kernel is required here. Then, when adding data, 

we simply use MCMC kernels to move the particles (the kernels as in Section I3.1.ip and the standard 

reversal kernel discussed in Section [Ol for the backward kernel. This leads to an incremental weight that 

is the ratio of the consecutive densities at the previous state. 

The potential advantage of this idea is that, when extending the state-space, there is no extra data, to 
potentially complicate the likelihood. Thus, it is expected that if the prior does not propose a significant 
number of new jumps, that the incremental weights should be of relatively low variance. The subsequent 
steps, when considering the jumps in [t„,t„+i) are performed on a common state-space and hence should 
not be subject to as substantial variability as when the state-space changes. This idea could also be adapted 
to the case that the likelihood on the new interval are tempered instead (e.g. Jasra et al. (2007)). 

As a theoretical investigation of this idea, we return to the discussion of Section 13.21 and in particular, 
where the joint target density is We consider the data-point tempering which starts with a draw from 
the prior and sequentially adds data points. In otherwords runs for ri + 1 time-steps with 




TT, 




n 




i=l 
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with a — oo < g < g < (X) such that for each i, yi and all xi, g < gi{yi; xi) < g. The algorithm resamples 
at every time-step and uses MCMC kernels, which are assumed to satisfy, for some r £ (0, 1), and each 
1 <n <ri, ri, ,Xi,x[ 

Knixi, •) > TKn{x[,-). 

At the very final time-step one also resamples after the final weighting of the particles. Write Xl, . . . , 
as the samples that approximate target Suppose / e Bb{Ei), then there is a Gaussian central limit 
theorem for 

Writing the asymptotic variance as cr^g (/), we have the following result whose proof is in the appendix. 

Proposition 2. For SMC sampler described above, with final target ^ then we have for any f G Bb{Ei) 
that there exists o B G (0, +oo) such that for any ri > 1, yi 

The upper-bound does not grow with the number of data. That is, by increasing the computational 
complexity linearly in the number of data, one has an algorithm whose error does not grow as more data 
(and regions) are added. This is similar to the observation of Beskos et al. (2011), when increasing the 
dimension of the target density. We note that the result is derived under exceptionally strong assumptions. 
In general, when one considers ri growing, one requires sharper tools than the Dobrushin coefficients used 
here (e.g. Eberle & Marinelli (2011)); this is beyond the scope of the current article and our result above 
is illustrative (and hence potentially over-optimistic). 

4.3 Online Implementation 

A key characteristic that has not yet been addressed is the fact that each has a computational complexity 
that is increasing with time. In a procedure that would otherwise be well suited to providing online 
inference, this is an unattractive feature. A large contribution to this increasing computational budget 
derives from the MCMC sweeps at the end of each iteration. As the space over which the invariant 
MCMC kernel is being applied is increased, so does the expense of the algorithm. An improvement to 
the computational demand of the samplers can therefore be made by keeping the space over which the 
MCMC kernel is applied constant. The reduced computational complexity (RCC) alternative to each of the 
samplers is also designed by amending the algorithms such that, at time i„, the MCMC sweep operates 
over, at most, 20 changepoints, i.e. over the interval [(/>fcj^_i9,t„). Due to the well-known path degeneracy 
problem in SMC (see Kantas et al. (2011)), the estimates will be poor approximations of the true values. 
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when including static parameters and extending the space of the point process for a long time. We note, 
at least for our application, it is reasonable to consider T fixed and thus, this is less problematic. 

5 The Finance Problem Revisited 

We now return to the example from Section [2] and the settings as in Section 13.1.21 
5.1 Simulated Data 

The saturated and tempered samplers, as well as their RCC alternatives, were implemented using the simu- 
lated data set (in Section [3.1.2|) . in order to compare their respective performances against the benchmark 
sampler and to compare the accuracy of the resulting intensity estimates against an observed intensity 
process. All of the alternative samplers were implemented under the same conditions, using the algorithm 
and model parameters as described for the implementation of the benchmark sampler. All results are 
averaged over 10 runs of the algorithm. 

In assessing the performance of the sampler, quantities of interest are, once again, the resampling rate 
and the processing time, as well as the minimum ESS recorded throughout the execution of the sampler. 
The resampling rates for all three samplers and their RCC alternatives are presented in Table [TJ with 
the corresponding minimum ESS's attained recorded in Table [2] and the corresponding processing times 
in Table [3] Figure [3] displays the evolution of the ESS over a particular run of the algorithm. Figure 2] 
shows the estimated intensity at each time t„ , given data up to time i„ . From Table [1] it is clear to see 
that, for the saturated and tempered samplers, an increase in M results in a decrease in the resampling 
rates, i.e. a decrease in sampler degeneracy, as expected. It is also plain to see from Table [5] that, as N 
increases, so does the minimum ESS, and thus the reliability of the estimates. From Tables [T] and [H Figure 
2] and comparing Figure [3] to Figure [5] it is clear that the saturated and tempered samplers significantly 
outperformed the benchmark sampler. 

We use the posterior medians to report intensities. Since we have access to a 'true' intensity process, the 
accuracy of these estimated intensity process is measured using the root mean square error (RMSE). Table 
H] presents the RMSEs of the intensity estimates (given the data up to i„, averaged over each t„) and Table 
[S] presents the RMSEs of the smoothed (conditional upon the entire data set) intensity estimates resulting 
from each of the three samplers and their RCC alternatives. The most important result to note is the 
performance of the saturated and tempered samplers in comparison with the unaltered sampler. As can be 
seen in terms of accuracy for intensity estimates, the two proposed alterations to the sampler improve the 
performance consistently and significantly. Looking at the resampling rates and processing times, in Tables 
[1] and [3] respectively, we can see that, as expected, although the tempered sampler resampled the particles 
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M 


= 1 


M 


= 5 


M 


= 20 


N=100 


N=1000 


N=100 


N=1000 


N=100 


N=1000 


Benchmark 


31.3% 


52.0% 


42.3% 


94.4% 


74.0% 


99.7% 


Benchmark - RCC 


37.6% 


88.1% 


69.0% 


99.7% 


99.4% 


99.7% 


Saturated 


21.0% 


21.3% 


19.7% 


20.1% 


18.2% 


17.6% 


Saturated - RCC 


20.7% 


20.7% 


18.5% 


18.8% 


15.4% 


15.4% 


Tempered 


2.0% 


2.0% 


1.9% 


1.9% 


1.7% 


1.7% 


Tempered - RCC 


2.0% 


2.0% 


1.7% 


1.8% 


1.4% 


1.4% 



Table 1: Table showing the resampling rates of each of the three SMC samplers and their reduced compu- 
tational complexity alternatives, for the six algorithm parameterisations that were tested. The ESS plots 
for the saturated and tempered samplers with N = 1000, M ~ b are given in Figure [3] for comparison with 
the corresponding ESS plot for the benchmark sampler given in Figure [2] 





M 


= 1 


M 


= 5 


M 


= 20 


N=100 


N^IOOO 


N=100 


N=1000 


N=100 


N=1000 


Benchmark 


1.0 


1.0 


1.0 


1.0 


1.0 


1.0 


Benchmark - RCC 


1.0 


1.0 


1.0 


1.0 


1.0 


1.0 


Saturated 


38.1 


410.2 


38.6 


397.0 


38.6 


398.9 


Saturated - RCC 


38.5 


401.2 


40.6 


394.4 


43.0 


425.9 


Tempered 


47.6 


484.7 


47.7 


475.5 


47.9 


483.4 


Tempered - RCC 


47.8 


475.7 


48.4 


481.7 


48.3 


486.6 



Table 2: Table showing the minimum ESS encountered during implementation by each of the three SMC 
samplers and their reduced computational complexity alternatives, for the six algorithm parameterisations 
that were tested. 
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M 


— 1 


M 


= 5 


M 


= 20 


N=100 


N=1GG0 


N=100 


N=10G0 


N=100 


N=10G0 


Benchmark 


612.9 


9689.1 


2849.7 


45690.4 


13352.1 


144621.3 


Benchmark - RCC 


449.0 


7910.9 


1132.7 


10657.6 


3106.2 


31208.5 


Saturated 


1125.3 


10667.8 


3234.3 


39061.1 


15381.9 


141817.3 


Saturated - RCC 


637.5 


6215.2 


1200.7 


11412.6 


4391.9 


47662.8 


Tempered 


1160.2 


10633.4 


3138.4 


38679.6 


14086.7 


130899.1 


Tempered - RCC 


666.0 


6424.4 


1156.3 


11209.1 


3231.3 


34795.3 



Table 3: Table showing the processing time, in seconds, for each of the three samplers and their reduced 
computational complexity alternatives, for the six algorithm parameterisations that were tested. 



0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 
Time 



(a) Saturated 



m 



41- 



0.5 0.6 0.7 0.8 0.9 
Time 



(b) Tempered 



Figure 3: Effective Sample Size plots for the SMC samplers with state space saturation (left) and data 
point tempering (right), run with N — 1000 particles and with Af = 5 MCMC sweeps at each iteration. 
The dashed line indicates the resampling threshold at N/2 = 500 particles; the corresponding resampling 
rates are 20.1% for the saturated sampler and 1.9% for the tempered sampler. 



significantly less than the benchmark sampler, the individual incorporation of each data point resulted in 
a greater computational cost. These two aspects of the benchmark and tempered samplers appear to have 
countered each other, resulting in their processing times being largely similar. 

We consider also the effect that changes in M and N have on the accuracy of estimates provided by 
the saturated and tempered samplers. For the saturated and tempered samplers, the results in Tables S] 
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and [S] corroborate the expected improvement in accuracy, in both for the sequential estimates at t„ given 
data up-to t„ and smoothed estimates (given the entire data) , that resuhs from an increase in the number 
of particles used. Whilst for the sequential estimates, there is no clear improvement in accuracy with 
increasing M, an improvement can be seen in the accuracy of the smoothed estimates. 





M 


= 1 


M 


= 5 


M = 


-- 20 


N=100 


N^IOOO 


N=100 


N=1000 


N=100 


N=1000 


Benchmark 


688.561 


1116.639 


620.432 


1942.992 


1330.232 


1501.263 


Benchmark - RCC 


676.932 


2026.956 


880.824 


2247.313 


1472.126 


1264.533 


Saturated 


242.834 


192.580 


228.390 


193.778 


237.315 


198.223 


Saturated - RCC 


229.449 


189.279 


224.692 


193.379 


225.592 


194.623 


Tempered 


254.396 


196.928 


247.754 


201.681 


248.367 


202.501 


Tempered - RCC 


256.012 


191.407 


227.241 


197.043 


230.805 


200.227 



Table 4: Table showing the root mean square error of the intensity. This is given the data up to i„, averaged 
over each i„ and for each of the three samplers and their reduced computational complexity alternatives, 
for the six algorithm parameterisations that were tested. 





M 


1 


M 


= 5 


M = 


= 20 


N=100 


N=1000 


N=100 


N=1000 


N=100 


N^IOOO 


Benchmark 


768.702 


670.656 


495.019 


627.909 


489.243 


571.107 


Benchmark - RCC 


698.640 


1034.890 


572.794 


572.841 


535.004 


599.031 


Saturated 


360.794 


264.331 


296.953 


114.064 


153.444 


89.397 


Saturated - RCC 


478.871 


265.477 


405.767 


266.980 


468.853 


205.243 


Tempered 


350.015 


170.321 


271.712 


128.078 


157.709 


81.666 


Tempered - RCC 


485.825 


249.529 


475.348 


193.898 


514.107 


180.914 



Table 5: Table showing the smoothed root mean square error of the intensity. This is given the entire data 
set and for each of the three samplers and their reduced computational complexity alternatives, for the six 
algorithm parameterisations that were tested. 

Finally, using the simulated data, we consider the performance of the samplers when limiting the space 
over which the invariant MCMC kernels are applied, i.e. the RCC alternatives. As can be seen from Table 
m the RCC alteration does not sacrifice any accuracy in the estimates of the intensity (given the data up to 
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(a) Benchmark (b) Saturated (c) Tempered 



Figure 4; Estimates (given the data up to t„) of the intensity of a simulated data set, generated by 
the benchmark SMC sampler (left) and the samplers with state space saturation (centre) and data point 
tempering (right), run with N — 1000 particles and with Af = 5 MCMC sweeps at each iteration. The 
model parameters were 7 = 0.001, ly = 150 and s = 20. 



each time i„), however it can be seen from Table [5] that the accuracy of the smoothed intensity estimates 
is rather poor. This is to be expected, due to path degeneracy; we note that one cannot estimate static 
parameters with the RCC approach unless the time window T is quite small. 

5.2 Real Data 

All three samplers were also tested on real financial data, with the RCC alternatives also being used to 
generate intensity estimates, given the data up to the share price of ARM Holdings, pic, traded on the 
LSE was used. The entire data set was of size = 1819, [0, T] = [0, 0.3] (represents 3/10 of a trading day, 
that is, 3/10 of 24 hours; the first trade is just after 9am and the last around 16:15.) with tn = n * 0.001. 
Genuine financial data is likely to correspond to a more volatile latent intensity process than that which 
was used to generate the synthetic data set, and so the parameterisation of the target posterior should 
be chosen such that large jumps in the intensity process are possible, and such that the intensity may 
also revert quickly to a lower intensity level. Hence, we specify: {7, s} = {0.001, 500, 250}. Each of the 
samplers were run using N = 1000 particles, applying M = 5 MCMC sweeps at each iteration, whilst the 
resampling rates and the minimum ESS obtained for each procedure were monitored to ensure that the 
algorithms did not collapse. 

Clearly, there is no 'known' intensity process against which to compare the point-wise estimates pro- 
duced by the samplers. In addition, any inverse-duration based representation of the intensity against 
which useful comparisons could be drawn would involve making assumptions on the smoothness of the 
intensity process itself. Thus, we turn to measuring the one-step-ahead predictive accuracy of the esti- 
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DcltcL tfi 


Smoothed 


Processing 


Resampling 




RMSPEs 


RMSPEs 


Times (s) 


Rates 


Saturated 


2.18876 


2.13479 


4064.5 


39.5% 


Saturated - RCC 


2.19112 




2193.1 


39.9% 


Tempered 


2.34671 


2.11468 


4605.5 


19.8% 


Tempered - RCC 


2.42776 




2237.3 


19.9% 



Table 6: Table showing the root mean square prediction errors for the intensity estimates (given data 
up to time t„ and entire data (smoothed)) given by each of the three samplers for the parameter values 
= 1000, M = 5. The RMSPEs for the smoothed intensity estimates given by the RCC alternatives to 
the samplers are also provided, along with the observed processing times and resampling rates for each 
sampler. 

mators of the intensity. This is achieved as follows: denoting the intensity estimated over the interval 
[tn-j,tn) as A„j, one predicts the expected number of ticks in the interval [tn+i-.j ,tn+i) as ^A„jJ for 
i > I and j > 1, where j is the number of periods over which the prediction is made and i is a lag index. 
The prediction errors are then calculated based on the predicted and observed number of ticks in the 
period [t„^i_j,tn+i); the root mean square prediction error (RMSPE) will be used. We will report on the 
one-step-ahead estimates (i = 1), estimating the intensity over each interval with j = 1. 

Table [5] presents the RMSPEs for the intensity estimates resulting from the samplers and the RCC 
alternatives. It was observed that, in calculating the RMSPEs for lag indices i = 1, . . . , 100 using each 
sampler, both the saturated and tempered samplers displayed the smallest error at i = 1, i.e. their respective 
one-step-ahead predictions were more accurate than those made for lags up to 2.64 hours (each observation 
interval corresponds to 0.0264 days — 1.584 minutes). 

The RCC samplers provide significant computational savings and do not seem to degrade substantially, 
w.r.t. the error criteria. Again, we remark that in general one should not trust the estimates of the RCC, 
but as seen here, they can provide a guideline for the intensity values. 

6 Summary 

In this paper we have considered SMC simulation for partially observed point processes and implemented 
them for a particular doubly stochastic PP. Two solutions were given, one based upon saturating the 
state-space, which is suitable in a wide variety of applications and data-point tempering which can be used 
in sequential problems. We also discussed RCC versions of these algorithms, which reduce computation. 
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but will be subject to the path degeneracy problem when including static parameters and considering the 
smoothing distribution. We saw that the methods can be successful, in terms of weight degeneracy versus 
the benchmark approach detailed in Del Moral et al. (2007). In addition, for real data it was observed 
that predictions using the RCC could be reasonable (relative to the normal versions of the algorithms), 
but caution on using these estimates should be used. 

The methodology we have presented is not online. As we have seen, when one modifies the approaches 
to have fixed computational complexity, the path degeneracy problem occurs and one cannot deal with 
scenario with static parameters. In this case, we are working with Dr. N. Whiteley on a technique based 
upon fixed window filtering. This is an on-line algorithm which allows data to be incorporated as they 
arrive with computational cost which is non-increasing over time, but is biased. The approach involves 
sampling from a sequence of distributions which are constructed such that, at time t„, previously sampled 
events in [0, tn^e] can be discarded. In order to be exact (in the sense of targeting the true posterior 
distributions), this scheme would involve point-wise evaluation of an intractable density. We are working 
on a sensible approximation of this density, at the cost of introducing a small bias. 
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Appendix 

Proposition [1] 

In this appendix we give a proof of Proposition [T] For probability meaure w and function /, w{f) :— 
J f{x)zu{dx). For any collection of points (x'i \ • ■ • 7 xi-i) £ ^n-i write 



The transition kernels are written Ki (which is not to be confused with the Ki from the SMC samplers 




algorithm) and for any n > 2, N > 1, iV— empirical density S, 



KqN 



.71 



is the kernel of invariant 



distribution 



^(t„-i,t„](y»,i;x„) 

Pn{yn,l\yn-l) 



p(x„,i)5'„_i(x„_i). 



Recall the generic notation a;„ g 



En- We drop the dependence upon the data and denote 



(11) 



Qn (^n) 



Pn{yn,l\yn-l) 



p(a;„,i)- 
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The A'^— empirical measure of points generated up to time n — 1 is written S'^„_]^. For a given n > 1, 
fn-- En^Rwe have the notation K„sn_^ {fn){x) := / fn{y)K^gN_^{x, dy) and i > 1, A'^ (/„)(a;) 
/g ^l^s" '^y)-^n,s" 1 (/n)(y)' (a;, dy) = (52:((iy) the Dirac measure. The CT— finite measure dx„^i 
is defined on the space En \ En-i; in practice it is the product of an appropriate version of Lebesgue and 
counting measures. 

The following assumption is made. 

Assumption (A). There exist an ei €E (0, 1) and probability measure ki on Ei such that for 
any xi € Ei 

Ki{xi, •) > eiKi(-)- 

For any n > 2, there exist an e„ G (0, 1) and probability measure k„ on En \ En-i such that 
for any Xn G En and any collection of points (Xn-iJ ■ ■ ■ jxi^i) € E^_i 

For any n>2 

sup / |g„(x„_i,x„,i)|dx„,i < +00 

where gn is as in 177]) . 

It should be noted that the uniform ergodicity assumption on KgN ^ nixm •) is quite strong. If the kernel 
KgN ^ „ were an Metropolis-Hastings independence sampler of proposal x qn{-) Xn — {xn-i, Xn,i), 

then 

I- \ \ • J 1 9n\VnHn\Xn,\) 1 „jv i \ l \ 
""^ L gn{Xn)qn[Vn,l) J 

satisfies the assumption if qn{xn,i) / gn{xn) is uniformly lower-bounded. Note also, due to the suppression 
of the data from the notation, it is typical that e„ would depend upon y„. 

Proof. The proof is inductive on n. Some details are omitted as the proof is quite similar to the control of 
adaptive MCMC chains, e.g. Andrieu et al. (2011). It should be noted the proof for this algorithm differs 
as the kernel possesses an invariant measure that does not change with the iteration i e {1, . . . , N}. 

Let n — 1 then, by (A) ifi, is a uniformly ergodic Markov kernel of invariant measure tti. It is 
simple to use the Poisson equation to prove the proposition, which is given to establish the induction. Let 
fi{xi) YHLa\-^i{fi){^i) - T^iC/i)] be the solution to the Poisson equation; /i - Ki{fi) = fi - ■ni{fi). 
Then 

N N 

^[/i(5;«)-^i(/i)] = Y.^h{xf)-K,{h){x'tH 
1=1 1=1 

N-l 

1=1 
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the first quantity on tlie R.H.S. is a Martingale, M^, w.r.t. the filtration (i.e. the cr— algebra generated 
by Markov chain). Then, using the Minkowski inequality 



1 ^ 



i=l 



1/p 



< 



N 



M 



1 IP 



i/p 



+ |/l(4^))|+E 



The last term can be dealt with as follows. 



E_(i 



i/p 



< E„_.(i) 



p-\ i/p 



i/p 



< 



< 



ii/iiiEv 



i=0 



[Ki - TTi] 



ll/l 



1/p 



ll/l 



ei 



here we have applied the conditional Jensen inequality and the bound on the total variation distance for 
uniformly ergodic Markov chains: Vx € Ei, supj.g^_>[o,i] \^i{f){x) ~ ^i{f)\ ^ (1 — ^lY- Note that this 
bound holds for any Xi G Ei. The Martingale term is bounded using the Biirkholder and Davis inequalities 
(i.e. the inequality below holds for any p > 1): 

p/2-i 1/p 



E_(i 



(1) 



M 



1 IP 
n\ 



1/p r iV-1 

< SpE (1) 



E[/i(#)-^i(/i)(4' 



Wm2 



i=l 



When p>2 the Minkowski inequality and the above manipulations yield a bound \/^-B(p, ei)||/i||, with 
B{p,ei) a constant only depending upon p and ei. When p G [1,2) the inequality (o — 6)^ < 2(a^ + 6^) 
for a, 6 e K is applied then Jensen to yield a similar bound; see Andrieu et al. (2011) and the references 
therein. Thus, for n = 1 it follows E_(i) [IM^I*"]^^^ < \/^-B(p, ei)||/i||; note that B{p,€i) depends only on 
ei and p - this is important in the sequel. Putting these bounds together and noting that, by the above 
arguments, the solution to the Poisson equation is uniformly bounded in x the proof at rank n = 1 is 
completed. 

Now assume the result at n — 1 and consider n. Note that via Fubini 

7r„(/„) = / fn{Xn)9n{Xn)T^n-lidXn-l)dXn,l = / /(/„ X fi(„) (x„_i)7r„_i ((ix„_i) 

Je„ J En-l 

where /(/„ x g^) = Je^\e^_^ fn{xn-i,Xn,i)9n{xn-i,Xn,i)dXn,i- Then application of the Minkowski in- 
equality yields: 



E_(i 



1 ^ 



1/p 



< E.(i) 



1 ^ 

J^T.M^n)-S^,r.-lWnX9n)) 



p-\ 1/p 



(12) 



mn-l-^n-imfnXgr,)) 



1/p 



Due to the induction hypothesis and (A), the second term on the R.H.S of the inequality is upper-bounded 
by 

Bp,„-1 SUp^^_^gB„_^ I{\fn X gn\){Xn-l) ^ Bp,„||/„|| 



\/]V 



\/]V 
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for some Bp^n < +00; if the data were not suppressed, then there is an exphcit dependence upon this 
quantity. Then considering the first term on the R.H.S of p^ . conditioning upon the cr— algebra J^f (g) 
■ • • generated by the process at time n is a uniformly ergodic Markov chain of invariant distribution 

S^,^_^{dxn-i)gn{xn)dxn,i- Thus, for example: 



K. 



Xfn){x. 



1/p 



l/nll 



1/p 



adopting exactly the above arguments. Noting that the bound on the conditional expectation is determin- 
istic, i.e. does not depend upon (E) ■ ■ ■ <E) J'n-iJ the induction is easily completed. □ 



Proof of Proposition [2] 

For the proof of Proposition [2J we require a round of notations. We write Xp = {xp^x'p) G Ef and define 
the following quantities: 

TTp {Xp ) 



Gp{xp) f- \ 

np_i{xp) 

with Go(io) = 1. In addition, set ?7o(-) = p(-) and 



1 < p < Ti 



Mp{xp^i,dxp) = 5x-^_^{dxp)Kp{xp,dx'p) 1 < p < ri 

We add an extra Markov kernel to allow us to use directly formulae in Del Moral (2004); Afri+i [xn , dxn+i) 
Si^^ (diri+i)- Then we define 

I(Ef)P-^^o{dXo)]YqZlGg{Xg)Mq{iq_l,dXq) 



Vpidxp) 



I(Ej)P m{dXo) Gq{Xq)Mq{5;q^l,diq) 



1 < P < fl + 1- 



In addition Qp{xp-i,dxp) — Gp^i{xp^i)Mp{xp-i,dxp), 1 <P < fi + 1, with 



Qp,n (*^p — 1 ; dXn^ — / Qp-\-l {Xp , dXp J^i ^ ■ ■ ■ Qn {xn-i,Xn) l<p<n<ri + l 

with the convention that Qp,p is the identity operator. Also define Pp,„(Jp_i, dxn) — Qp,n(ip-i, dxn) / Qp,n{'^){xp-i) 
and finally 

-T^ f~ j~ \ Qp,n{Xp~l^ dXji^ 

^p.n\Xp-li U'Xn) = 7TT ■ 

1^p^p,n \ ^) 

Proof of Proposition\^ We have from Proposition 9.4.2 of Del Moral (2004) that: 



(13) 



p=0 



The objective is to re-write the summand in terms of a difference Pp^ri+i{x, ■) — Pp,ri+i{x\ ■) and use the 
mixing conditions to control the Dobrushin coefficient of the kernel Pp^n+i] see e.g. Del Moral et al. (2012) 
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section 4. To that end, we can only consider the first ri — 1 terms, for which the Dobrushin coeffient will 
satisfy: 

(14) /3(Pp,n+i) ■■=snp\\Pp.r,+i{S:,-)-Pp.r,+i{i\-)\\u, < (1 - p)L['-i+i-Pl/2J n - p > 1 

x,x' 

for some p e (0, 1) that does not depend upon ri and || • the total variation distance (again see Del 
Moral et al. (2012), as the condition {A4)2 of that paper is satisfied). The reminder of the terms in the 
sum are easily bounded, independently of ri, and we omit these calculations. 

By using standard properties of Feynman-Kac formula, we have that each summand in (|13p is equal to 

V'?p('3p,ri+l(l))^ ?7p(Qp,ri + l(l))2 

By using Jensen's inequality, it follows that 

gp,,, + l(l)2 rUQp,r,+l{l)[Pp,r, + l{f){i) - Pp,r, + l{fW 



Vp 



'7p('9p,ri+l(l))^ VpiQp.ri + li^W 

^\\f\\2F,(-p ^2 '?p(Qp,'-i + l(l))^)^ 

rip{Qp,ri+lWr 

Using the fact that (see e.g. section 4 of Del Moral et al. (2012)) 

Qp,ri + l(l))(£) . „ 

for a i? e (0, +oo) that does not depend on ri and using the bound in (IH]) we can conclude. □ 
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